″[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems” by Towards_Keeperhood

Update: 2026-01-02

Description

2.1 Summary

In the last post, I introduced model-based RL, which is the frame we will use to analyze the alignment problem, and we learned that the critic is trained to predict reward.

I already briefly mentioned that the alignment problem is centrally about making the critic assign high value to outcomes we like and low value to outcomes we don’t like. In this post, we’re going to try to get some intuition for what values a critic may learn, and thereby also learn about some key difficulties of the alignment problem.

Section-by-section summary:

2.2 The Distributional Leap: The distributional leap is the shift from the training domain to the dangerous domain (where the AI could take over). We cannot test safety in that domain, so we need to predict how values generalize.
2.3 A Naive Training Strategy: We set up a toy example: a model-based RL chatbot trained on human feedback, where the critic learns to predict reward from the model's internal thoughts. This isn't meant as a good alignment strategy—it's a simplified setup for analysis.
2.4 What might the critic learn?: The critic learns aspects of the model's thoughts that correlate with reward. We analyze whether [...]

---

Outline:

(00:16 ) 2.1. Summary

(03:48 ) 2.2. The Distributional Leap

(05:26 ) 2.3. A Naive Training Strategy

(07:01 ) 2.3.1. How this relates to current AIs

(08:26 ) 2.4. What might the critic learn?

(09:55 ) 2.4.1. Might the critic learn to score honesty highly?

(12:35 ) 2.4.1.1. Aside: Contrast to the human value of honesty

(13:05 ) 2.5. Niceness is not optimal

(14:59 ) 2.6. Niceness is not (uniquely) simple

(16:02 ) 2.6.1. Anthropomorphic Optimism

(19:26 ) 2.6.2. Intuitions from looking at humans may mislead you

(21:12 ) 2.7. Natural Abstractions or Alienness?

(21:35 ) 2.7.1. Natural Abstractions

(23:15 ) 2.7.2. ... or Alienness?

(25:54 ) 2.8. Value extrapolation

(27:49 ) 2.8.1. Coherent Extrapolated Volition

(32:48 ) 2.9. Conclusion

The original text contained 11 footnotes which were omitted from this narration.

---

First published:

January 2nd, 2026

Source:

https://www.lesswrong.com/posts/vv6QojgvprM4jYJLr/advanced-intro-to-ai-alignment-2-what-values-may-an-ai-learn

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Comments

In Channel

“In My Misanthropy Era” by jenn

2026-01-0413:52

“AI #149: 3” by Zvi

2026-01-0445:28

“The bio-pirate’s guide to GLP-1 agonists” by quiet_NaN

2026-01-0309:42

“The Weirdness of Dating/Mating: Deep Nonconsent Preference” by johnswentworth

2026-01-0211:50

″[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems” by Towards_Keeperhood

2026-01-0234:23

“College Was Not That Terrible Now That I’m Not That Crazy” by Zack_M_Davis

2026-01-0201:18:07

“2025 in AI predictions” by jessicata

2026-01-0221:54

“Lumenator 2.0” by Keri Warr

2026-01-0206:22

“Taiwan war timelines might be shorter than AI timelines” by Baram Sosis

2026-01-0209:06

″$500 Write like lsusr competition - Results” by lsusr

2026-01-0204:17

“Overwhelming Superintelligence” by Raemon

2026-01-0102:46

“Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning on natural facts” by ryan_greenblatt

2026-01-0131:26

“You will be OK” by boazbarak

2026-01-0104:47

“2025 Year in Review” by Zvi

2026-01-0124:39

“Chromosome identification methods” by TsviBT

2026-01-0111:26

“AI Futures Timelines and Takeoff Model: Dec 2025 Update” by elifland, bhalstead, Alex Kastner, Daniel Kokotajlo

2025-12-3150:47

“The Plan - 2025 Update” by johnswentworth, David Lorell

2025-12-3113:52

“Grading my 2022 predictions for 2025” by Yitz

2025-12-3118:34

“Dating Roundup #9: Signals and Selection” by Zvi

2025-12-3124:35

“End-of year donation taxes 101” by GradientDissenter

2025-12-3006:18

00:00

″[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems” by Towards_Keeperhood

#box-pro-ellipsis-176762082541579{-webkit-line-clamp:2;}″[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems” by Towards_Keeperhood

″[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems” by Towards_Keeperhood

″[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems” by Towards_Keeperhood